Assessing binary classifiers using only positive and unlabeled data

نویسندگان

Marc Claesen

Jesse Davis

Frank De Smet

Bart De Moor

چکیده

Assessing the performance of a learned model is a crucial part of machine learning. Most evaluation metrics can only be computed with labeled data. Unfortunately, in many domains we have many more unlabeled than labeled examples. Furthermore, in some domains only positive and unlabeled examples are available, in which case most standard metrics cannot be computed at all. In this paper, we propose an approach that is able to estimate several widely used metrics including ROC and PR curves using only positive and unlabeled data. We provide theoretical bounds on the quality of our estimates. Empirically, we demonstrate that even given only a small number of positive examples and unlabeled data, we are able to reliable estimate both ROC and PR curves.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets

In this report, I presented my results to the tasks of 2008 UC San Diego Data Mining Contest. This contest consists of two classification tasks based on data from scientific experiment. The first task is a binary classification task which is to maximize accuracy of classification on an evenly-distributed test data set, given a fully labeled imbalanced training data set. The second task is also ...

متن کامل

Optimally Combining Classifiers Using Unlabeled Data

We develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifie...

متن کامل

Learning to Rank Biomedical Documents with only Positive and Unlabeled Examples: A Case Study

In the text mining field, obtaining training data requires human experts' labeling efforts, which is often time consuming and expensive. Supervised learning with only a small number of positive examples and a large amount of unlabeled data, which is easy to get, has attracted booming interests in the field. A recently proposed relabeling method, which assumes unlabeled data as negative data for...

متن کامل

Cool Blog Classification from Positive and Unlabeled Examples

We address the problem of cool blog classification using only positive and unlabeled examples. We propose an algorithm, called PUB, that exploits the information of unlabeled data together with the positive examples to predict whether the unseen blogs are cool or not. The algorithm uses the weighting technique to assign a weight to each unlabeled example which is assumed to be negative in the t...

متن کامل

Named Entity Disambiguation in Streaming Data

The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the trainin...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1504.06837 شماره

صفحات -

تاریخ انتشار 2015

Assessing binary classifiers using only positive and unlabeled data

نویسندگان

چکیده

منابع مشابه

Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets

Optimally Combining Classifiers Using Unlabeled Data

Learning to Rank Biomedical Documents with only Positive and Unlabeled Examples: A Case Study

Cool Blog Classification from Positive and Unlabeled Examples

Named Entity Disambiguation in Streaming Data

عنوان ژورنال:

اشتراک گذاری